The Topology of Pseudoknotted Homopolymers 
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We consider the folding of a self-avoiding homopolymer on a lattice, with saturating hydrogen 
bond interactions. Our goal is to numerically evaluate the statistical distribution of the topological 
genus of pseudoknotted configurations. The genus has been recently proposed for classifying pseu- 
doknots (and their topological complexity) in the context of RNA folding. We compare our results 
on the distribution of the genus of pseudoknots, with the theoretical predictions of an existing com- 
binatorial model for an infinitely flexible and stretchable homopolymer. We thus obtain that steric 
and geometric constraints considerably limit the topological complexity of pseudoknotted configu- 
rations, as it occurs for instance in real RNA molecules. We also analyze the scaling properties at 
large homopolymer length, and the genus distributions above and below the critical temperature 
between the swollen phase and the compact-globule phase, both in two and three dimensions. 
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One of the most exciting fields in modern computa- 
tional molecular biology is the search for tools predict- 
ing the complex foldings of bio-polymers such as RNA 
mij>|j|> when homologous sequences are not available. 
The prediction of the full tertiary structure of a RNA 
molecule is still an open issue mainly because of its 
intrinsic high computational complexity It is known 
that the tertiary structure involves an important set of 
structural motifs, the so-called pseudoknots These 
are conformations such that the associated disk diagram 
(which represents all nucleotides along the RNA back- 
bone as points on an oriented circle from the 5' end to 
the 3' end, and where each base-pair is represented by 
an arc joining the two interacting nucleotides, inside the 
circle; see Figure ^ is not planar, i.e. it contains inter- 
secting arcs. RNA pseudoknots have been identified in 
nearly every organism, and they proved to play impor- 
tant regulatory and functional roles Their ubiquity 
manifests in a large variety of possible shapes and struc- 
tures 01, and their existence should not be neglected in 
structure prediction algorithms, as they account for the 
10%-30% on average of the total number of base pairs. 
Actually, several computer programs have been proposed 
for predictin g R NA secondar y st ructures including pseu- 
doknots ilQ,lll E H El 111 (the list is not ex- 
haustive), but the complexity of the problem and the 
approximations involved are usually such that the issue 
is far from being solved 17] . 

An analytic mathematical tool which can fully describe 
any RNA contact structure including all possible pseu- 
doknots, appeared first in ^l]- There, all RNA disk dia- 
grams are considered as Feynman diagrams of a suitable 
field theory oi N x N Hermitian matrices (a combinato- 
rial tool borrowed from quantum field theory). The lat- 
ter is known to organize all the diagrams according to an 
asymptotic topological expansion at large- TV [l^ . 
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FIG. 1: Top, from left to right: a hairpin loop (PDB num- 
ber 1NA2), its squiggle-plot and disk diagram representation 
which is of genus zero since it is planar. Bottom: H-type 
pseudoknot (PDB number IRNK). In this case the disk dia- 
gram is not planar and has genus one. 



This provides in fact a rigorous way to classify non-planar 
diagrams, and therefore it induces a natural topological 
classification of pseudoknots ^3- Namely, to any given 
pseudoknotted configuration (and more generally, to any 
contact structure of an heteropolymer with binary satu- 
rating interactions), one can associate an integer number 
g, the genus. It is defined as the topological genus of the 
associated disk diagram, i.e. by x = 1 — 2f; where x is the 
Euler characteristic number of the diagram. As reviewed 
in [2I1, the genus is the minimum number of handles 
the disk should have in order that all the cords are not 
intersecting (see Figure QJ. Other characterizations of 
pseudoknots have been proposed (e.g. 0,112, H^). The 
classification ^3 is truly topological, meaning that it is 
independent from the way the diagram is drawn, and de- 
pendent only on the intrinsic complexity of the contact 
structure. 

The large- iV asymptotics of the analytical model in ^l] 
is hard to obtain exactly. However, in 0] a special case 
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of the general model 18] has been considered and solved. 
It was the simple case of an infinitely flexible and stretch- 
able homopolymer, where there is no dependence on the 
primary sequence, and any saturating base pair between 
all the "nucleotides" is allowed. An analytical asymptotic 
expansion was evaluated and the distribution of the genus 
of pseudoknotted contact structures was obtained. One 
of the results is that an homopolymer with L nucleotides 
has an average genus close to the maximal one, that is 
L/4. Of course, real RNA molecules are not infinitely 
flexible and stretchable homopolymers. It is customary 
to assume that the bases i and j can interact only if they 
are sufficiently far apart along the chain (e.g. |i — i| > 4, 
) because of bending rigidity. Moreover helices have a 
long persistence length 200 base pairs) and this neces- 
sarily constrains the allowed pairings even more. We ex- 
pect that including all steric and geometrical constraints 
should considerably decrease the genus of allowed pseu- 
doknots, compared to the purely combinatorial case 
where the actual three-dimensional conformation was ne- 
glected. The purpose of this Letter is to numerically an- 
alyze the effects of steric and geometric constraints on 
the genus distribution of pseudoknots topologies in ho- 
mopolymers, in the same spirit of [23 |. 

The Model: we model the system by considering a 
polymer on a cubic lattice, i.e. a self-avoiding random 
walk with short-range attractive interaction [23|. A self- 
avoiding walk (SAW) is a sequence of neighboring lattice 
sites i = 0, 1, . . . , n with coordinates {r^}, such that the 
same lattice-site cannot be visited more than once. This 
is a standard approach in polymer physics (and RNA: see 
e-g- 1^11^)' The attractive interaction is usually used 
to describe bad solvent quality, but in our case we in- 
sist more on the saturating nature of the hydrogen bond 
interactions. Such a requirement is crucial here, since 
the concept of the topological genus for a contact struc- 
ture can be defined unambiguously only when the inter- 
actions are saturating. One of the most natural ways to 
model the interaction is by considering a "spin" model 
(see e.g. H^EEE^)- Strictly speaking, our model is a 
variation of the standard 0-polymer model, and similar 
interaction models for RNA on the lattice have been al- 
ready proposed (e.g. [13 )■ To each vertex i we associate 
a unit spin which represents the nucleotide direction 
with respect to the backbone. The only allowed direc- 
tions for Si are the lattice ones. Moreover, the spins can- 
not overlap with the backbone because of the excluded 
volume between the nucleotides and the backbone. The 
saturating nucleotide-nucleotide interaction occurs when 
two spins Si,Sj on neighboring sites, jr^ — rj| — 1, are 
pointing to each other. The energy of a configuration 
{ri. Si] is thus defined by the Hamiltonian: 

7i = -e^ 6{vi + Si -~Yj)5{si + Sj)5{\vi - r^l - 1) , (1) 
where e > is an effective hydrogen-binding energy, the 



same for all monomers of the chain. Let us note that 
since we are not aimidgng to set up a realistic lattice 
model for RNA-folding, but rather to understand steric 
effects on the genus distributions of a homopolymer, we 
do not take into account stacking energies. 

The basic features of our model are clear: At high tem- 
peratures, we expect the system to be in a swollen SAW 
state (entropy dominated coil state), whereas at lower 
temperatures we expect a kind of "compact globule" -like 
phase j2nl |. The transition temperature Tg defines the 
so-called 0-point. However, details on the thermodynam- 
ics, kinetics, phase diagram, etc. can be rather complex 
[23I E^ . We limit ourselves here only to the analysis 
of the genus distribution of pseudoknotted structures for 
comparing the effects of stericity constraints versus the 
purely combinatorial model of |2J|. All other considera- 
tions are postponed elsewhere. 

The Method: The numerical sampling of the statisti- 
cal distribution Z = X^SAW {si} 6xp(— Ti/fcsT), where ks 
is Boltzmann's constant, T is the absolute temperature, 
and the sum is restricted to SAWs and configurations of 
spins {si} satisfying the aforementioned constraints, is 
implemented by using the Monte Carlo Growth Method. 
It was originally proposed by T. Garel and H. Orland 
in 30j and has been applied to several statistical sys- 
tems since then (see references in [sJl). It consists in 
starting with an ensemble of chains at equilibrium and 
then growing each chain by adding one monomer at a 
time with a probability proportional to the Boltzmann 
factor for the energy of the chain. At each step the en- 
semble remains at equilibrium (a detailed description of 
the algorithm with applications can be found in |3ll|'). 
It belongs to the family of so-called "population Monte 
Carlo algorithms" [s^, where, contrary to the "dynam- 
ical" Markov Chain Monte Carlo methods, the popula- 
tion is fully grown and evolved, non-dynamically. At high 
temperatures we considered populations with a variable 
number of chains in the range 10000-40000, and with a 
typical length of L = 500 monomers (up to L = 1200 
in some cases). Accuracy and statistical averages were 
computed by taking several independent populations (of 
the order of 40). At low temperatures we considered pop- 
ulations of up to 100000 chains. All the simulations have 
also been performed on a square lattice in two dimen- 
sions. 

Results and discussion: We expect different genus dis- 
tributions above and below Tg. We therefore first de- 
termine Tg , which can be done efficiently by computing 
the end-to-end distance i?g = (r^ — Tq)^, and the ra- 
dius of gyration i?^ — X]i<j(''i ~ '^jY l^"^- It is known 
that the ratio = {Rl)/{Rg) is universal in the hmit 
L — > 00 and converges to a step function as a function 
of T, with a universal critical value at Tg 

mUllli In 

Figure 121 we plot p which shows a transition temperature 
Te = 0.39 ± 0.01 {T^^ = 0.48 ± 0.02 in two dimensions). 
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in units where fcs = 1 and e = 1. We also verified that 




FIG. 2: The ratio p in 3d as a function of the temperature T, 
for several values of L (with error bars plotted only for L = 
60). At low temperatures the error bars are larger than the 
variations of the curves. The inset shows that as L increases, 
the curves approach a universal step function about Te ~ 
0.39 ±0.01. 

Poo = 2.5 ± 0.05 for T ^ Tg asymptotically, and we 
find an intermediate value pg — 2.35 ± 0.05 at Tg (and 
Poo = 2.67 ± 0.01 and pg = 2.39 ± 0.01 in 2d, respec- 
tively). At large-L we find the following scalings: for 
T > Tg, (Rg) - L" with ly = 0.59 ±0.01 {u = 0.75 ±0.02 
in two dimensions), which is consistent with the critical 
exponent of a swollen SAW; for T < Te, = 0.32 ± 0.02 

= 0.50 ± 0.01 in two dimensions) which is consistent 
with a compact phase. All these results are in agreement 
with high-accuracy simulations of similar models 
We then proceed with extracting the genus distributions 
in the two phases. The results are in Figure 13 and 01 
When comparing them with the combinatorial results of 
[i^ l. we see that the genus at a fixed L is on the aver- 
age much smaller. More precisely, below the 6'-point the 
average genus scales like by (g/L) ^ 0.141 ± 0.003 and 
(g/L) - 0.1318 ± 0.0025, in 3d (at T = 0.2) and 2d (at 
T = 0.225), respectively. In both cases the scaling is at a 
lower rate (about 50% less) than the value L/A computed 
in 0. In the swollen-phase (e.g. T ^ 10 > Tg), the av- 
erage genus is given by (g/L) ~ (585 ± 8)10^® in 3d, and 
(g/L) ~ (410± 1)10^*^ in 2d. Such a low rate comes from 
the tendency of a homopolymer to develop long rectilin- 
ear sub-chains in the swollen phase. In two dimensions 
the entropic factor is smaller than in three dimensions 
and the genus growth rate is therefore larger (see Fig- 
ure 01). Moreover, the genus distributions for T > Tg 
are numerically consistent with Poissonian distributions 
(see Figure EJ, whereas at smaller temperatures they are 
closer to Gaussian ones. 

It turns out that the average genus is an extensive 
quantity, like the energy, and their ratio is shown in 
Figure |S1 All these results confirm that the genus dis- 
tribution behaves differently in the two phases, as ex- 
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FIG. 3: The genus distributions of pseudoknots at fixed L, in 
2d (top) and 3d (bottom), in the compact phase at T = 0.225 
and T = 0.2, respectively. The insets represent the behavior 
of {g) at large-L. 
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FIG. 4: The genus distribution of a two dimensional ho- 
mopolymer, in the swollen phase T = 10 > T^^, at various 
values of L. 

pected. They also quantify how much the restrictions 
induced by the actual three-dimensional arrangement of 
the chain can limit the number and complexity of pseudo- 
knots (compared to [l^l). We find values closer to what 
seems to happen in pseudoknots of real RNA molecules. 
In fact, real RNA molecules typically have small genus. 
For instance, a simple -ff-type pseudoknot (~20 bases or 
more) or the classical kissing-hairpins pseudoknot (^30 





-0.15 
-0.3 







L = 500 _ 




L = 300 




L = 180 




L = 100 




L = 60 



0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 T 



FIG. 5: The ratio genus/energy of an homopolymer on a cubic 
lattice, as a function of T, at different values of L. 

bases or more) both ha ve g enus 1, much less than the 
toy-model prediction in [2J|. Even tRNAs (~80 bases) 
mostly contain 4 helices, two of them linked together by 
a kissing-hairpin pseudoknot, has still genus 1. Typical 
tmRNAs (~350 bases long) contain four i?-type pseudo- 
knots, and its total genus is 4, far below the theoretical 
upper bound L/4. Our numerical results would instead 
indicate, for instance for a 80 bases long homopolymer, 
a genus of about 11.2 in three dimensions (^10.5 in two 
dimensions) . Even if it is smaller than the value sug- 
gested in (because of the steric constraints) , it is still 
too high when compared to real RNA molecules. The 
obvious reasons are that we neither included the primary 
sequence nor realistic stacking energies. We have never- 
theless been able to quantify the general effect of steric 
constraints on the genus distribution of a pseudoknotted 
homopolymer on a lattice, as a first step towards a model 
which includes a more realistic energy function. 
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